Having a good education is very important for students and their parents. There are many factors that need to think completely. From the huge number of factors that need to be considered, the graduation rate is always one of the most important indicators for their choices. According to INFORMnny, graduation rates in New York State have grown steadily since 2005 and reached 84.8% of students earned diplomas in 2020. In this report, we are trying to visualize the graduation rate of each counties and races in New York State. Our goal is to find how factors such as different counties, students’ genders, and races influence the graduation rate. The result of our project may perform as a guide for up-coming college students to choose which counties are their best choice based on their goals. Moreover, it may provide more insights for educators to improve the quality of education.
We will use two datasets. The first one is Graduation Rate Database which contains annual graduation rate, and dropout rate data for the state as well as by county, Need to Resource Capacity group, district, public school and charter school. These datasets are published on data.nysed.gov once a year, and we’ll be analyzing the data from 2014 to 2020. The second dataset is the latitude and longitude of the county boundaries in New York State stored in GeoJSON format.
We preprocessed the data before drawing plots. Here’s the link for raw data: graduation rate, county boundaries. Here’s the link for processed data: processed data.
The plot below is the time series of graduation rates by different races, cohorts, and counties. There are two drop-down menus at the bottom of the graph. You can select the different names of counties and cohorts to see different sets of lines. Another interactive tool is tooltip. You can hangover your mouse on the line to show the exact graduation rate of that group. Different colors highlight the difference between races.
# load the data
grad<- read.csv("https://uwmadison.box.com/shared/static/n0paw3xy9sdanw7c06u0tcxvyqh7gwew")
grad$subgroup_code<- as.factor(grad$subgroup_code)
grad$membership_code<- as.factor(grad$membership_code)
# filter out different races
grad1<- grad %>%
filter(subgroup_code == "1"|subgroup_code== "5"|subgroup_code== "6"|subgroup_code== "7"|subgroup_code== "8", grad_pct > 10) %>%
as_tsibble(index = year, key = c("subgroup_code","membership_code", "county_code"))
# convert to a dd/mm/yy format
grad2<- grad1
grad2$year[grad2$year == "2020"]<- "1/1/2020"
grad2$year[grad2$year == "2019"]<- "1/1/2019"
grad2$year[grad2$year == "2018"]<- "1/1/2018"
grad2$year[grad2$year == "2017"]<- "1/1/2017"
grad2$year[grad2$year == "2016"]<- "1/1/2016"
grad2$year[grad2$year == "2015"]<- "1/1/2015"
#write.csv(grad2, "grad2.csv")
For membership code, it refers to the cohort and the school year being reported. 6 - 6-year outcome, June; 8 - 5-year outcome, June ; 9 - 4-year outcome, June; 10 - 5-year outcome, August; 11 - 4-year outcome, August; 18 - 6-year outcome, August.
robservable("https://observablehq.com/@zijianzhang/m3", include = 7, height = 400, width = 900)
Maybe it’s hard to summarize the trend of graduation rate on county level, we can see it on state level instead. (Since 5-year outcome, August and 6-year outcome, August only have two years of data, we omit them.)
grad3<- grad1 %>%
filter(membership_code == "6"|membership_code == "8"|membership_code == "9"|membership_code
== "11") %>%
as_tibble() %>%
group_by(membership_code, subgroup_name, year) %>%
summarise(grad_pct_mean = mean(grad_pct))
grad3$membership_code<- factor(grad3$membership_code, levels = c(9, 11, 8, 6))
viz<- ggplot(grad3) +
geom_line(aes(x = year, y = grad_pct_mean, color = subgroup_name), size = 0.8) +
facet_wrap(~ membership_code,
labeller = as_labeller(c(`6` = "6-year outcome, June",
`8` = "5-year outcome, June",
`9` = "4-year outcome, June",
`11` = "4-year outcome, August"))) +
scale_color_brewer(palette = "Set1") +
theme_bw() +
labs(y = "Average Graduation Rate (%)") +
ggtitle("Average Graduation Rate by Cohorts (NY State level)") +
theme(panel.grid.minor = element_blank(),
legend.position = "bottom")
gp<- ggplotly(viz)
layout_ggplotly <- function(gg, x = -0.02, y = -0.08){
# The 1 and 2 goes into the list that contains the options for the x and y axis labels respectively
gg[['x']][['layout']][['annotations']][[1]][['y']] <- x
gg[['x']][['layout']][['annotations']][[2]][['x']] <- y
gg
}
gp %>% layout_ggplotly(x = -0.05, y = -0.05)
The Choropleth Maps below show the percentage of average values and average standard deviations of graduation rate of each county in New York State respectively. Specifically, the x-axis is the latitude and the y-axis is the longitude, which form maps that can display the coordinates of each county. The colors on the maps represent the quantitative data of the percentage of graduation rate mean and graduation rate standard deviation, the higher value of quantitative data , the darker the color on the map. We can see from the graphs that the percentage of average graduation rate of each county in New York State is different, which may be due to the education level of each county, the strength of the teachers and the students’ own race.
county_mean <- read.csv("https://uwmadison.box.com/shared/static/o1ry3o0jtmxzm8vbbnajitkts8dcnbor.csv")
names(county_mean)[2] <- "name"
boundary <- read_sf("https://uwmadison.box.com/shared/static/kwdzpyq0k6az4gd9sp05r61u8pn67t7r.geojson") %>%
filter(state_name == "New York")
name <- toupper(boundary$name)
boundary$name<- name
county_mean$name<- as.character(county_mean$name)
county_mean[45, 2]<- "ST. LAWRENCE"
df <- left_join(boundary, county_mean, by = "name")
county_sd <- grad %>%
group_by(county_name) %>%
summarise(pct_sd = sd(grad_pct))
names(county_sd)[1] <- "name"
county_sd$name<- as.character(county_sd$name)
county_sd[45, 1]<- "ST. LAWRENCE"
df2<- left_join(df, county_sd, by = "name")
p1 <- ggplot() +
geom_sf(data = df, aes(fill = pct_mean,text=paste('</br>County: ', str_to_title(name),'</br>Average Graduation Rate: ',round(pct_mean,3))), inherit.aes = FALSE) +
coord_sf(crs = st_crs(3857)) +
labs(x = "Longitude", y = "Latitude", fill = "Mean of Graduation Rate") +
ggtitle("Average Graduation Rate for each County") +
theme_bw() +
theme(legend.position = "bottom") +
scale_fill_distiller(palette = "YlOrRd", direction = 1)
ggplotly(,tooltip = "text")
p2 <- ggplot() +
geom_sf(data = df2, aes(fill = pct_sd, text=paste('</br>County: ', str_to_title(name),'</br>Standard Deviation of Graduation Rate: ',round(pct_sd,3))),inherit.aes = FALSE) +
coord_sf(crs = st_crs(3857)) +
#geom_sf_text(data = df, aes(label = str_to_title(name)), check_overlap = T, size = 1.65, col = "blue") +
labs(x = "Longitude", y = "Latitude", fill = "SD of Graduation Rate") +
ggtitle("Standard Deviation of Graduation rate for each County") +
theme_bw() +
theme(legend.position = "bottom") +
scale_fill_distiller(palette = "RdPu", direction = 1)
ggplotly(,tooltip = "text")
set.seed(479)
data=read_csv('grad_1.csv')
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_character(),
## X1 = col_double(),
## aggregation_index = col_double(),
## aggregation_code = col_double(),
## county_code = col_double(),
## membership_code = col_double(),
## membership_key = col_double(),
## subgroup_code = col_double(),
## enroll_cnt = col_double(),
## year = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 4917 parsing failures.
## row col expected actual file
## 8172 enroll_cnt a double - 'grad_1.csv'
## 8177 enroll_cnt a double - 'grad_1.csv'
## 8185 enroll_cnt a double - 'grad_1.csv'
## 8186 enroll_cnt a double - 'grad_1.csv'
## 8189 enroll_cnt a double - 'grad_1.csv'
## .... .......... ........ ...... ............
## See problems(...) for more details.
f=function(a){
return(a/100)
}
data2 = data%>%
dplyr::select(matches('pct|subgroup_name|county_name'))%>%
mutate_all(funs(str_replace(.,'%','')))%>%
mutate(across(!c(subgroup_name,county_name),as.numeric))%>%
drop_na()%>%
mutate_at((3:10),f)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
pca_rec <- recipe(~., data = data2) %>%
update_role(subgroup_name,county_name, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_prep <- prep(pca_rec)
data3=juice(pca_prep)[,1:4]
cluster <- function(data, K) {
x <- data %>%
select(PC1,PC2)
kmeans(x, center = K) %>%
augment(data) %>% # creates column ".cluster" with cluster label
mutate(silhouette = silhouette(as.integer(.cluster), dist(x))[, "sil_width"])
}
result=cluster(data3,4)
data4=cbind(data2,result$.cluster)
colnames(data4)[11]='cluster'
We used grad_pct, local_pct, reg_pct, regadv_pct, non_diploma_credential_pct, still_enr_pct, ged_pct and dropout_pct to make the k-means plot. The meaning of each variable is below:
Graduation percentage: Percent of students in the cohort who earned either a Regents Text 255 or Local diploma.
Local diploma percentage: Percent of students in the cohort who earned a Local diploma.
Regent diploma percentage: Percent of students in the cohort who earned a Regents Text 255 diploma.
Regent diploma with Advanced Designation percentage: Percent of students in the cohort who earned a Regents Text 255 diploma with Advanced Designation.
Non-diploma percentage: Percent of students in the cohort who earned a non-diploma commencement credential (i.e., CDOS credential, Skills & Achievement certificate).
Still enrolled percentage: Percent of students in the cohort who were still enrolled as of Text 255 June 30.
Equivalency percentage: Percent of students in the cohort who entered an approved Text 255 high school equivalency preparation program.
Drop out percentage: Percent of students who dropped out.
Firstly, we dropped the rows with NA and did PCA on these variables. Secondly, we used PC1 and PC2 to train the K-means model. We set K at 4 because we found that this value can separate the data best. Then we made plots among the subgroups, which can show the difference of cluster distribution between different subgroups. The table below shows the average values of the variables of each cluster.
library("readr")
library("tidytext")
components_ <- pca_prep %>%
tidy(2) %>%
filter(component %in% str_c("PC", 1:2)) %>%
mutate(terms = reorder_within(terms, abs(value), component))
components_[,1]=rep(c('Graduation percentage','Local diploma percentage','Regent diploma percentage','Regent diploma with Advanced Designation percentage','Non-diploma percentage','Still enrolled percentage','Equivalency percentage','Drop out percentage'),2)
ggplot(components_, aes(value, terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ component, scales = "free_y") +
labs(y = NULL) +
ggtitle("Components of Principle Component 1 and Principle Component 2") +
theme_bw() +
theme(legend.text=element_text(size=10),
panel.grid.minor = element_blank(),
legend.position = "bottom",
plot.title = element_text(hjust = 0.5))
Interpretation: For principle component 1, graduation percentage contributes the most, but the difference between it and the others is not big. However, Regent diploma with Advanced Designation percentage and Regent diploma percentage contribute much more that the others in principle component 2.
t1=data4 %>%
group_by(cluster) %>%
summarise(mean(grad_pct))
t2=data4%>%
group_by(cluster)%>%
summarise(mean(local_pct))
t3=data4%>%
group_by(cluster)%>%
summarise(mean(reg_pct))
t4=data4%>%
group_by(cluster)%>%
summarise(mean(reg_adv_pct))
t5=data4%>%
group_by(cluster)%>%
summarise(mean(non_diploma_credential_pct))
t6=data4%>%
group_by(cluster)%>%
summarise(mean(still_enr_pct))
t7=data4%>%
group_by(cluster)%>%
summarise(mean(ged_pct))
t8=data4%>%
group_by(cluster)%>%
summarise(mean(dropout_pct))
colnames(t1)[2]='value'
colnames(t2)[2]='value'
colnames(t3)[2]='value'
colnames(t4)[2]='value'
colnames(t5)[2]='value'
colnames(t6)[2]='value'
colnames(t7)[2]='value'
colnames(t8)[2]='value'
variable=c(rep('grad_pct',4),rep('local_pct',4),rep('reg_pct',4),rep('reg_adv_pct',4),rep('non_diploma_credential_pct',4),rep('still_enr_pct',4),rep('ged_pct',4),rep('dropout_pct',4))
t=cbind(rbind(t1,t2,t3,t4,t5,t6,t7,t8),variable) %>%
pivot_wider(id_cols = cluster, names_from = variable, values_from = value)
colnames(t) = c('Cluster','Graduation percentatge','Local diploma percentage','Regent diploma percentage','Regent diploma with Advanced Designation percentage','Non-diploma percentage','Still enrolled percentage','Equivalency percentage','Drop out percentage')
library(knitr)
kable(t)
| Cluster | Graduation percentatge | Local diploma percentage | Regent diploma percentage | Regent diploma with Advanced Designation percentage | Non-diploma percentage | Still enrolled percentage | Equivalency percentage | Drop out percentage |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.4962500 | 0.2154416 | 0.2519124 | 0.0292590 | 0.0742328 | 0.1601984 | 0.0136639 | 0.2533945 |
| 2 | 0.9013773 | 0.0441703 | 0.3775606 | 0.4798053 | 0.0082267 | 0.0327387 | 0.0045398 | 0.0514700 |
| 3 | 0.8653535 | 0.0641427 | 0.5240293 | 0.2772371 | 0.0123757 | 0.0386491 | 0.0059942 | 0.0757503 |
| 4 | 0.7180578 | 0.1023912 | 0.4729104 | 0.1429690 | 0.0207496 | 0.0895510 | 0.0124029 | 0.1572458 |
result %>% dplyr::filter(grepl('All|American|Asian|Black|Hispanic|Multiracial|White', subgroup_name)) %>%
ggplot(aes(col = .cluster)) +
geom_point(aes(x=PC1,y=PC2),size=0.3) +
facet_wrap(.~subgroup_name)+
scale_color_brewer(palette = "Set2") +
labs(x = 'Principle Component 1',
y = 'Principle Component 2',
title = 'K-means Plot Group By Race',
color = 'Cluster')+
theme_bw() +
theme(legend.text=element_text(size=10),
panel.grid.minor = element_blank(),
legend.position = "bottom",
plot.title = element_text(hjust = 0.5))+
guides(colour = guide_legend(override.aes = list(size=4)))
We can see the race and graduation rate do have correlation and colored race tends to have lower graduation rate because they have more points in cluster 2.
result %>% dplyr::filter(grepl('All|Econo', subgroup_name)) %>%
ggplot(aes(col = .cluster)) +
geom_point(aes(x=PC1,y=PC2),size=0.3) +
facet_wrap(.~subgroup_name)+
scale_color_brewer(palette = "Set2") +
labs(x = 'Principle Component 1',
y = 'Principle Component 2',
title = 'K-means Plot Group By Economic Situation',
color = 'Cluster')+
theme_bw() +
theme(legend.text=element_text(size=10),
panel.grid.minor = element_blank(),
legend.position = "bottom",
plot.title = element_text(hjust = 0.5))+
guides(colour = guide_legend(override.aes = list(size=4)))
The economically disadvantaged students have less points in cluster 3 which has the best performance in all the indexes above.
There is a strong relationship between English language skill and the graduation rate.
Students in foster care have lower graduation rate than the others.
Students that are migrants still have lower graduation rate than the others.
Homeless students have lower graduation rate than the others.
We don’t see significant difference between the two groups.
Almost all the disabled students are put into cluster 1 and 2 which have the lowest graduation rate and the highest dropout rate.
For plot 1:
In general, we could find that New York State’s graduation rate has slowly increased between 2015 and 2020, which represents a steady increase in the level of education of New York State during recent years with the development of society as a whole.
The time series of different counties varied greatly. Graduation rates are gradually increasing from 2014 to 2020.
For all kinds of cohorts, it is a little bit surprised that Asian/Pacific Islander have the highest graduation rates among all groups; white people having slightly lower rates. Both are above average. Hispanics and African-Americans are below average, but it is hard to distinguish these two groups. This result confirms the fact that different races have unequal educational opportunities
The shapes of the time series of different cohorts are roughly the same.
For plot 2:
Looking specifically at Choropleth Maps, we could see that Westchester and Suffolk have the highest graduation rates among all counties, which is 80.69% and 80.56% respectively,but their average standard deviation is the lowest around all the counties.
Generally, counties with higher graduation rates will have lower standard deviation of graduation rates. It shows that education inequalities are greatr in counties with low graduation rates than in counties with high graduation rates.
For plot 3:
The plots can show the difference between the subgroups. Race, homelessness, economic situation, disability, language skill and other factors showed in the plots have significant correlation with the students performance at school. More researches are needed to explore how these factors influence the students and find ways to improve such situation.
In terms of personal factors, we conclude that the graduation rate is related to factors such as race, homelessness, financial status, disability, language skills and other factors. Asian/Pacific Islander has a higher graduation rate than other races, while the graduation rate of blacks, Hispanics and African-Americans among all races is relatively low. This may mean that different races have different educational opportunities, or that New York State’s educational methods are more suitable for Asian/Pacific Islanders, so for them, New York State’s high schools could provide them with good educational opportunities. For students with poor English (probably international students), financial burden, migration, foster care, homelessness, and disabilities, the probabilities of graduating for them are relatively low. Most schools in counties of New York State may not have developed a plan for such diverse students to help them graduate better now.
#Look at one dataset to know the structure and detect NAs
grad20<- read.csv("https://uwmadison.box.com/shared/static/rdyzsbt2pfwqzbhtq31ks5fw0644xc9c.csv")
skim(grad20)
## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
## Warning in sorted_count(x): Variable contains value(s) of "" that have been
## converted to "empty".
| Name | grad20 |
| Number of rows | 227450 |
| Number of columns | 37 |
| _______________________ | |
| Column type frequency: | |
| factor | 26 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| report_school_year | 0 | 1 | FALSE | 1 | 201: 227450 |
| aggregation_type | 0 | 1 | FALSE | 5 | Sch: 144138, Dis: 74012, Cou: 8168, Nee: 988 |
| aggregation_name | 0 | 1 | FALSE | 2104 | NEW: 328, BEA: 242, EAR: 236, PAT: 230 |
| entity_inactive_date | 0 | 1 | FALSE | 5 | emp: 226774, 06/: 242, 06/: 232, 06/: 110 |
| lea_name | 0 | 1 | FALSE | 695 | emp: 90094, NEW: 8396, NEW: 3516, NEW: 2686 |
| nrc_desc | 0 | 1 | FALSE | 8 | Ave: 73658, NYC: 60742, Rur: 30180, Low: 28312 |
| county_name | 0 | 1 | FALSE | 63 | KIN: 19740, BRO: 17350, NEW: 17090, SUF: 13854 |
| boces_name | 0 | 1 | FALSE | 39 | NEW: 67024, NAS: 11866, emp: 9300, EAS: 9098 |
| membership_desc | 0 | 1 | FALSE | 6 | 201: 38211, 201: 38211, 201: 37824, 201: 37824 |
| subgroup_name | 0 | 1 | FALSE | 24 | All: 12358, Not: 12358, Par: 12358, Not: 12352 |
| grad_cnt | 0 | 1 | FALSE | 3450 | -: 60554, 5: 3210, 6: 2785, 7: 2613 |
| grad_pct | 0 | 1 | FALSE | 102 | -: 60554, 100: 17343, 96%: 8350, 97%: 8251 |
| local_cnt | 0 | 1 | FALSE | 987 | -: 60554, 0: 37547, 1: 21298, 2: 15694 |
| local_pct | 0 | 1 | FALSE | 96 | -: 60554, 0%: 39517, 3%: 10752, 2%: 10582 |
| reg_cnt | 0 | 1 | FALSE | 2692 | -: 60554, 4: 4202, 5: 4179, 3: 4129 |
| reg_pct | 0 | 1 | FALSE | 102 | -: 60554, 50%: 5278, 40%: 4096, 45%: 3975 |
| reg_adv_cnt | 0 | 1 | FALSE | 1810 | -: 60554, 0: 32435, 1: 8086, 2: 6385 |
| reg_adv_pct | 0 | 1 | FALSE | 102 | -: 60554, 0%: 32593, 33%: 2934, 29%: 2736 |
| non_diploma_credential_cnt | 0 | 1 | FALSE | 329 | 0: 111899, -: 60554, 1: 23550, 2: 10539 |
| non_diploma_credential_pct | 0 | 1 | FALSE | 55 | 0%: 118818, -: 60554, 1%: 18886, 2%: 10217 |
| still_enr_cnt | 0 | 1 | FALSE | 1015 | 0: 64067, -: 60554, 1: 28511, 2: 16497 |
| still_enr_pct | 0 | 1 | FALSE | 100 | 0%: 67214, -: 60554, 1%: 19327, 2%: 15891 |
| ged_cnt | 0 | 1 | FALSE | 283 | 0: 116670, -: 60554, 1: 21502, 2: 9200 |
| ged_pct | 0 | 1 | FALSE | 41 | 0%: 121942, -: 60554, 1%: 17915, 2%: 10423 |
| dropout_cnt | 0 | 1 | FALSE | 1118 | -: 60554, 0: 34184, 1: 25065, 2: 17932 |
| dropout_pct | 0 | 1 | FALSE | 92 | -: 60554, 0%: 36283, 2%: 11400, 3%: 11289 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| aggregation_index | 0 | 1.00 | 3.590000e+00 | 5.900000e-01 | 0 | 3.00000e+00 | 4.000e+00 | 4.00000e+00 | 4 | ▁▁▁▅▇ |
| aggregation_code | 0 | 1.00 | 3.489146e+11 | 1.782793e+11 | 0 | 2.51101e+11 | 3.313e+11 | 4.91200e+11 | 680801040001 | ▃▃▇▃▅ |
| lea_beds | 90094 | 0.60 | 3.512888e+11 | 1.673100e+11 | 10100010000 | 2.80218e+11 | 3.313e+11 | 4.41202e+11 | 800000090459 | ▃▇▆▃▂ |
| nrc_code | 8312 | 0.96 | 3.770000e+00 | 1.950000e+00 | 1 | 1.00000e+00 | 5.000e+00 | 5.00000e+00 | 7 | ▇▁▃▇▅ |
| county_code | 1132 | 1.00 | 3.496000e+01 | 1.771000e+01 | 1 | 2.60000e+01 | 3.300e+01 | 4.90000e+01 | 68 | ▃▂▇▃▅ |
| nyc_ind | 1132 | 1.00 | 3.000000e-01 | 4.600000e-01 | 0 | 0.00000e+00 | 0.000e+00 | 1.00000e+00 | 1 | ▇▁▁▁▃ |
| boces_code | 9300 | 0.96 | 3.434300e+03 | 1.699020e+03 | 190 | 2.49000e+03 | 3.090e+03 | 4.59000e+03 | 6690 | ▂▅▇▃▃ |
| membership_code | 0 | 1.00 | 1.033000e+01 | 3.760000e+00 | 6 | 8.00000e+00 | 9.500e+00 | 1.10000e+01 | 18 | ▇▇▅▁▃ |
| membership_key | 0 | 1.00 | 1.550000e+02 | 2.580000e+00 | 152 | 1.53000e+02 | 1.545e+02 | 1.56000e+02 | 160 | ▇▇▅▁▃ |
| subgroup_code | 0 | 1.00 | 1.204000e+01 | 7.190000e+00 | 1 | 6.00000e+00 | 1.100e+01 | 1.80000e+01 | 25 | ▇▇▇▆▆ |
| enroll_cnt | 0 | 1.00 | 2.728900e+02 | 3.442000e+03 | 1 | 1.00000e+01 | 4.900e+01 | 1.21000e+02 | 210130 | ▇▁▁▁▁ |
gg_miss_upset(grad20)
## clean dataset for each year and select county (may take a long time to read...)
grad20<- grad20 %>%
filter(aggregation_index == "2") %>%
dplyr::select(-entity_inactive_date:-nrc_desc, -nyc_ind:-boces_name) %>%
mutate(year = "2020")
grad19<- read.csv("https://uwmadison.box.com/shared/static/ceqk6p10j1jq060l23jmibs3fr9vryq4.csv") %>%
filter(aggregation_index == "2") %>%
dplyr::select(-entity_inactive_date:-nrc_desc, -nyc_ind:-boces_name)%>%
mutate(year = "2019")
grad18<- read.csv("https://uwmadison.box.com/shared/static/cfo64w5y2lo7l5tgxsrgymahnwmnr8p2.csv") %>%
filter(AGGREGATION_INDEX == "2") %>%
dplyr::select(-LEA_BEDS:-NRC_DESC, -NYC_IND:-BOCES_NAME) %>%
mutate(year = "2018")
name<- tolower(colnames(grad18))
colnames(grad18)<- name
grad17<- read.csv("https://uwmadison.box.com/shared/static/qeo58dsbhzj7wz4fn0kr0ejlnygvlvj9.csv") %>%
filter(AGGREGATION_INDEX == "2") %>%
dplyr::select(-LEA_BEDS:-NRC_DESC, -NYC_IND:-BOCES_NAME) %>%
mutate(year = "2017")
name<- tolower(colnames(grad17))
colnames(grad17)<- name
grad16<- read.csv("https://uwmadison.box.com/shared/static/tokjrznpqbp343elxbotdwr2guzqrtkn.csv") %>%
filter(AGGREGATION_INDEX == "2") %>%
dplyr::select(-ENTITY_INACTIVE_DATE:-NRC_DESC, -NYC_IND:-BOCES_NAME) %>%
mutate(year = "2016")
name<- tolower(colnames(grad16))
name[1]<- "report_school_year"
colnames(grad16)<- name
grad15<- read.csv("https://uwmadison.box.com/shared/static/cu5yyzj3py2cy342e6erto9a15qjglu4.csv") %>%
filter(AGGREGATION_INDEX == "2") %>%
dplyr::select(-LEA_BEDS:-NRC_DESC, -NYC_IND:-BOCES_NAME) %>%
mutate(year = "2015")
name<- tolower(colnames(grad15))
colnames(grad15)<- name
grad<- rbind(grad20, grad19, grad18, grad17, grad16, grad15)
# combine above datasets, convert to numeric, remove NAs and "-"
grad %>%
mutate_at(c("grad_cnt","local_cnt","reg_cnt","reg_adv_cnt","non_diploma_credential_cnt","still_enr_cnt","ged_cnt","dropout_cnt"),funs(gsub("-", NA, ., fixed = TRUE))) %>%
mutate_at(c("grad_pct","local_pct","reg_pct","reg_adv_pct","non_diploma_credential_pct","still_enr_pct","ged_pct","dropout_pct"),funs(as.numeric(gsub("%", "", ., fixed = TRUE))/100)) %>%
drop_na()
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
cols.num <- c(seq(13, 30, 1))
grad[cols.num] <- sapply(grad[cols.num],as.numeric)
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
# unified the name of categories
grad[grad == "Asian or Pacific Islander"]<- "Asian/Pacific Islander"
grad[grad == "Hispanic or Latino"]<- "Hispanic"
grad[grad == "American Indian or Alaska Native"]<- "American Indian/Alaska Native"
grad[grad == "Black or African American"]<- "Black"
# convert it to a tsibble object
grad_tsibble<- as_tsibble(grad, index = year, key = c("subgroup_code","membership_code", "county_code"))
write.csv(grad_tsibble, "grad.csv")
# average graudation rate for each county
county_grad <- grad %>%
group_by(county_name) %>%
summarise(pct_mean = mean(grad_pct))